System and Method for Creating a Scalable Monolithic Packet Processing Engine

ABSTRACT

A novel and efficient method is described that creates a monolithic high capacity Packet Engine (PE) by connecting N lower capacity Packet Engines (PEs) via a novel Chip-to-Chip (C2C) interface. The C2C interface is used to perform functions, such as memory bit slicing and to communicate shared information, and enqueue/dequeue operations between individual PEs.

BACKGROUND OF THE INVENTION

Packet Engines (PEs) are devices or Integrated Circuits (ICs or Chips)that perform packet processing, such as packet classification, policing,filtering and switching. In any given technology, there is always apractical limit in how fast a monolithic PE can be built. In order tobuild a higher capacity system, a number of PEs can be joined together.The traditional way of doing this is to use Modular Systems that join anumber of individual PEs using a central packet switch, oftenimplemented in the form of a packet backplane with central switchfabric, and the PEs sitting on line-cards interfacing to the backplaneswitch. Although Modular Systems allow the construction of very largeswitching systems, they can no longer be considered a “monolithicnon-blocking” switch, because in these large systems, the introductionof the central fabric always introduces QoS or performance limitationswith certain traffic patterns. In addition, the ability to performshared operations across a Modular System, such as policing andprotection switching on different interfaces sitting on different PEs,is lost.

This invention uses a unique design that allows two or more PEs to bejoined together, while keeping the monolithic non-blocking feature-set.The Bandwidth (BW) in terms of BPS (Bit Per Second), the processingpower in terms of PPS (Packets Per Second) and the number of interfacesare increased by a factor of “N”, where “N” is the number of PEs joinedtogether. Given the assumption that for a given technology one can onlybuild PEs with capacity X, using this technique, (Multi-chip) PEs withcapacity of N*X can be built.

SUMMARY OF THE INVENTION

This invention describes a novel design that can create a monolithichigh capacity Packet Engine, called NPE, by connecting N lower capacityPacket Engines (PEs) via a Chip-to-Chip (C2C) interface. The C2Cinterface is used to perform functions, such as memory bit slicing tostore packets in a distributed manner in the memory of individual PEsand to communicate shared information, such as enqueue/dequeueoperations between them. This technique is a very efficient method ofcreating a powerful PE with higher capacity than a single PE can obtain.For certain cases, e.g. N=2, it is also possible to obtain a form ofredundancy where the dual device operation can be gracefully degraded tosingle PE operation (and single PE performance) in case of a hardwarefailure. If this is coupled with the use of certain link protectionprotocols such as Ethernet Link Aggregation with the links being spreadover the two constituent PE's, traffic can be maintained in case of ahardware failure, but at a reduced performance level.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of a Modular System with two line-cardsand a central switch fabric.

FIG. 2 is a schematic diagram of a single stand alone Packet Engine withtraffic interfaces

FIG. 3 is a functional diagram of a single Packet Engine (PE) operation

FIG. 4 is a functional diagram of a dual-PE operation

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Carrier-class switches are typically constructed using building blockssuch as NPUs (Network Processing Units) and TM's (Traffic managers).These two building blocks are often combined into a single integrateddevice (IC or chip), including one or more traffic interfaces (e.g.Ethernet ports). Such an IC is suitable for building carrier-classnetworks and is called “Packet Engine” (PE) in this document.

Packet engines perform operations like:

-   -   Packet classification into different flows with different        Quality of Service (QoS)    -   Termination and handling of various communication protocols such        as Multi-protocol Label Switching (MPLS)    -   Switching and routing    -   Ability to partition network into Virtual Private Networks    -   Admission control of individual flows using shared or dedicated        Policers    -   Discard of excess traffic, according to their QoS, using        techniques such as Weighted Random Early Discard (WRED)    -   Queuing and scheduling system    -   Operation and Management (OAM) functions    -   Protection switching mechanisms to perform fast recovery in case        of network problems

A PE is characterized by the fact that it can perform all of the abovefunctionality across all of its available interfaces with norestrictions on how traffic flows between the ports. However, PEperformance is also characterized by a set of basic parameters; two ofthe most important are the Bandwidth (BW) supported in terms of BPS(Bits Per Second) and the packet processing capability in terms ofnumber of PPS (Packets Per Second) the PE can handle. Since PEs oftenneed large buffer storage for the queue system (typically in externalmemory called RAM), the maximum BW supported by a device is very oftenlimited by the BW of the RAM used to construct the buffer system. Thepacket processing capability in terms of PPS is limited by the abilityto perform table lookups and packet classifications, often using amixture of both internal and external RAM.

There is always an interest in being able to construct a higherperformance and larger PE to handle the ever increasing bandwidthrequirements, while maintaining the service model offered by a singlenon-cascaded monolithic device for higher bandwidths. Using even thelatest Integrated Circuit (IC) technologies, there is always a practicallimit on the performance of a monolithic PE that can be built. The onlyway to build a higher performance PE is to join a number of monolithicPEs together. The traditional way of doing this is using a ModularSystem, as shown in FIG. 1. In a Modular system such as the one shown inFIG. 1, a number of individual PEs (130, 160) are connected to a centralpacket switch fabric (150), often implemented in the form of a packetbackplane with central switch fabric chips residing on the backplane andthe PEs sitting on line-cards (110, 120), interfacing to the backplaneswitch either directly or through fabric interface chips (140, 170).Modular systems allow the construction of very large (Tera-bit persecond) switching systems, but they also have a number of drawbacks.Most importantly, the modular systems can no longer be considered a“monolithic non-blocking” switch. They behave more like a system ofindividual switches connected by a (internal) network, providing someform of Quality of Service (QoS). The introduction of the central fabricalways introduces QoS or performance limitations with certain trafficpatterns. In these large systems, the ability to perform sharedoperations across multiple PEs on different linecards, such as policingor protection switching, is also lost.

Furthermore, the interface to the backplane switch fabric consumesbandwidth in the PE, and as such, behaves like any other trafficinterface on the device. FIG. 2 shows a stand alone PE device. If thestandalone PE (220) in FIG. 2 can process a total of X Gbps (Giga bitper second) BW across all its external interfaces (230, 240), it (130,160) will only be able to handle X/2 Gbps BW on its external interfaces(100, 101) in a Modular System, because the other X/2 Gbps BW isconsumed by the internal fabric interface (180, 190).

This invention uses unique design that allows 2 or more PEs to be joinedtogether, while keeping the monolithic non-blocking features. TheBandwidth in terms of PPS, the packet processing power in terms of PPS,and the number of external interfaces can be increased by a factor of N,where N is the number of PEs joined together. This novel design enablesthe creation of monolithic switches that are N times more powerful thanindividual PEs they are constructed from. For the rest of this document,a monolithic PE constructed from N number of PEs is called “NPE”. Forexample, a monolithic PE constructed from two PEs is called 2PE. A 2PEdevice, also called a dual-PE, is of special interest, since itsimplementation is very straight forward and with less complexity. Therest of this section describes the 2PE device, but it is equallyapplicable to NPE devices, as well.

An NPE device can (at any point in time) be split into its individual PEcomponents, which can then operate individually. This allows costeffective construction of redundant hardware for certain redundancyscenarios. As an example, take the case of N=2: This will provide agraceful degradation to 50% of the 2PE bandwidth in case one devicefails. Through the use of this feature, it becomes possible (at low orzero cost) to design networks and network elements which will continueto work, in case of a hardware failure, but the capacity will be reducedto (N−1)/N %. In order to achieve this, the N separate PE devices sit onseparate line cards, such that a faulty PE device can be replaced, whilethe other PE devices continue to operate.

This invention combines a number of technologies, such as bit slicing,to achieve the NPE goal, but combines these in a novel way to create theNPE capable device.

FIG. 3 shows the block diagram of a single PE device. The main blocks inthe drawing:

-   -   Ingress traffic interfaces (300)    -   Control Path (320)    -   Buffer system (330)    -   Egress Queues & scheduler (340)    -   Egress traffic interfaces (370)

As can be seen from FIG. 3, traffic enters the PE from the ingressinterfaces (300). The packets are written to the buffer memory (330),for temporary storage, as well as sent to the control path (320), forlookup, classification, policing, etc. When the control path hasfinished processing a given packet, the control path commands (380) thebuffer memory (330) that the packet should either be discarded (packetmemory freed again), or enqueued (350) in one or more egress queues(340). Multicast requires sending a packet to more than one egressqueue. The egress scheduler (340) reads packets from the queues andtransmits them on the egress interfaces (370). When the last copy of apacket has been transmitted, the packet memory is freed again.

FIG. 4 shows a 2PE (dual-PE) block diagram. The 2PE operation is verysimilar to a single PE operation. The description here describes whatgoes on in one of the PE chips, but the same goes on in the other PEchip, with very few modifications (will be described below in moredetails). Traffic enters from ingress interfaces (400, 401), and is sentboth to the control path (420, 421) and stored in the buffer memory(430, 431). Packets entering each PE (410, 411) get stored in bothbuffer memories (430, 431) using bit slicing technique over theChip-to-Chip (C2C) interface (490). Since each control path (420, 421)handles only packets entering from local traffic interfaces (400, 401),they split the work between them perfectly. The buffer system uses acommon bit-sliced memory, created by combining the memory interfaces onboth chips. Effectively, this results in 50% of packet bits being storedin memory associated with each chip. Each chip owns and controls exactly50% of the shared buffer memory, and has its own free list for buffermaintenance. When the control path on a PE chip has finished processingthe packet, the result might be that the packet needs to be enqueued(450, 451) either on one or more local egress queues (on the same PE),or enqueued (491, 492) in egress queues on the other PE. In case oflocal enqueues, the enqueue operation is straightforward, and verysimilar to single PE operation. In case of a “remote” enqueue (from onechip to the other chip), the enqueue request (482) is sent to the remotequeue system over the C2C (490) bus, together with a packet pointer,which points to the packet in the shared buffer system. No packet datais transferred in this operation, because the packet is alreadyaccessible to both devices in the bit-sliced buffer memory.

Egress transmission on both chips is straightforward: The packets areread from the bit sliced memory (effectively reading from memories onboth PEs), and transmitted on the egress interfaces (470, 471). However,when a complete packet has been transmitted, the buffer system on a PEdoes two different things, depending on whether the packet originatedfrom itself or not. If the packet originated from the same PE, itinforms the buffer manager on the same PE (460, 461) that this packetcopy is no longer needed, and the buffer manager keeps track of when thelast copy that has been sent, so that the memory can be returned to thefree list (for this local PE chip). If the packet originated from theother PE chip, it informs the buffer manager on the other PE chip (493,494), via C2C interface, that this copy is no longer needed. In thisway, the buffer manager on each PE chip maintains full control over thememory it owns (50%), regardless of the ingress/egress traffic patternsacross the two PE chips.

So, as described above, the C2C interface performs memory bit slicingand carries “remote enqueue” operations and “remote dequeue” (packetcopy no longer needed) operations, as described above. There are also anumber of other protocols going on over the C2C bus, which include:

1) Policing. In order to support shared policers across a dual-PEsystem, all policing buckets are kept (maintained) on one of thechips—called a police master. The other chip (police slave) performspolicing operations by sending information (policer number, packetlength, etc.) to the police master, over the C2C, and receives thepolice answer (primarily packet color: red, yellow, green), again, overthe C2C. In this way, flows ingressing on both chips can share the samepolicer (or have individual policers), just as required.

2) OAM packet handling. For certain protocols like Multiprotocol LabelSwitching (MPLS), the ingress interface of an MPLS Tunnel can suddenlychange without warning. This does not present any problem for datapackets that need to be forwarded, but for connectivity check OAMpackets (packets sent at fixed intervals to allow detection of a faultylink), it means that these need to be handled by a central agent,spanning both PEs. In such 2PE operation, one PE is an OAM master, andthe other PE is an OAM slave. The OAM slave PE chip informs the masterPE (over the C2C bus) that an OAM packet has arrived on a particularlink. In this way, the OAM master is always informed about OAM packetarrival, regardless of which interface/chip the packet arrives on, andis able to perform the “loss of connectivity” check in a straightforwardfashion, just as if it was done on a single chip.

3) Central Processing Unit (CPU). In a 2PE operation, each PE may resideon a different line-card. Usually, each line-card has its own CPU forperforming software related functions and protocols. The C2C interfacein a 2PE operation permits the two CPUs of corresponding line-cards tocommunicate with each other, over the C2C interface. With propersoftware, the two CPUs could be synchronized regarding the informationabout both cards, and in case of failure of one of the CPUs, the otherone can take over the control and operation of both line-cards.

As the carrier class protocols evolve, there will likely be morecommunication going on over the C2C bus to maintain the monolithic viewacross both chips, but the memory bit slicing and remote enqueue/dequeueare by far the largest bandwidth users on the link now, and will likelycontinue to be so, in the future.

Note that N can be larger or equal to 2, in an NPE system. The bitslicing protocol scales very nicely to solutions with N>2. However,other protocols described above do not scale linearly. For example, thepolice master will need to handle the policing operations for all PEchips (to support shared policers across any combination of PE chips),which does not scale very well. Therefore, there is some improvement foran NPE system, but it is not scaled linearly with N. However, still, itwould be helpful on the overall performance.

Any variations of the above teaching are also intended to be covered bythis patent application.

1. A system for processing and switching of packets in a communicationnetwork, said system comprising: a communication medium; and one or moredevices; multiple general interfaces; each one of said one or moredevices comprising: a buffer; a packet processing engine; multiplequeues; one or more queue schedulers; multiple ports; and a chip-to-chipinterface; wherein said buffer comprises one or more internal orexternal random access memories; wherein packets entering said multipleports of a first of said one or more devices are stored in said bufferof said first of said one or more devices; wherein said packetprocessing engine comprises one or more of the following: a packetclassifier, a packet filter, a switch, a policer or an operation andmaintenance engine; wherein said switch identifies one or more of saidmultiple queues for enqueue operation; wherein one or more of saidmultiple queues are assigned to said multiple general interfaces;wherein said one or more queue schedulers are assigned to said multiplegeneral interfaces; wherein said one or more queue schedulers are incharge of scheduling packets for transmission and subsequent dequeueoperation out of said multiple queues; wherein said chip-to-chipinterface of said first of said one or more devices is connected to restof said one or more devices; wherein said chip-to-chip interface of saidfirst of said one or more devices is used for communicating informationbetween said first of said one or more devices and said rest of said oneor more devices; and wherein said chip-to-chip interface of said firstsaid one or more devices is used to store said packets in said buffer ofsaid rest of said one or more devices.
 2. A system as recited in claim1, wherein said chip-to-chip interface of said one or more devices isused to distribute a storage of said packets entering said first of saidone or more devices equally among said buffer of all of said one or moredevices.
 3. A system as recited in claim 2, wherein said packets aredistributed in bit-sliced fashion equally among said buffer of all ofsaid one or more devices.
 4. A system as recited in claim 2, whereinsaid packets are distributed in byte-sliced fashion equally among saidbuffer of all of said one or more devices.
 5. A system as recited inclaim 2, wherein said packets are distributed in word-sliced fashionequally among said buffer of all of said one or more devices.
 6. Asystem as recited in claim 1, wherein said first of said one or moredevices asks a second of said one or more devices to enqueue saidpackets in said multiple queues of said second of said one or moredevices.
 7. A system as recited in claim 1, wherein said first of saidone or more devices asks a second of said one or more devices to dequeuesaid packets from said multiple queues of said second of said one ormore devices.
 8. A system as recited in claim 1, wherein said policer ofsaid first of said one or more devices performs a policing function forsaid rest of said one or more devices.
 9. A system as recited in claim1, wherein said operation and maintenance engine of said first of saidone or more devices performs an operation and maintenance function forsaid rest of said one or more devices.
 10. A system as recited in claim1, wherein multiple component links are associated with said multipleports of said one or more devices.
 11. A system as recited in claim 10,wherein said first of said one or more devices performs a linkaggregation without interruption, by removing said one or more of saidmultiple component links from a link aggregation group, when one or moreof said multiple component links have a failure.
 12. A system as recitedin claim 1, wherein said first of said one or more devices gracefullyrecovers and operates normally, when said chip-to-chip interface of saidfirst of said one or more devices has a failure.
 13. A system as recitedin claim 1, wherein said first of said one or more devices gracefullyrecovers and operates normally, when said rest of said one or moredevices have a failure.
 14. A system as recited in claim 1, wherein Xnumber of said one or more devices are connected to each other, via saidchip-to-chip interface of said X number of said one or more devices,behaving as a single instance of said one or more devices that switchestraffic from any of said multiple ports of said X number of said one ormore devices to rest of said multiple ports of said X number of said oneor more devices.
 15. A system as recited in claim 14, wherein said Xnumber of said one or more devices reside on the same line-card.
 16. Asystem as recited in claim 14, wherein said system has a totalprocessing power, in terms of bits per second, that is equal to X timesof a processing power of one of said X number of said one or moredevices.
 17. A system as recited in claim 14, wherein said system has atotal processing power, in terms of packets per second, that is equal toX times of a processing power of one of said X number of said one ormore devices.
 18. A system as recited in claim 14, wherein said X numberof said one or more devices operate with one or more external centralprocessing units that run a required data processing software.
 19. Asystem as recited in claim 18, wherein said X number of said one or moredevices continue normal operation, when at least one of said one or moreexternal central processing units is operational.
 20. A system asrecited in claim 14, wherein said X number of said one or more devicesreside on the different line-cards.